{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pw6cvzTocw4G"
      },
      "source": [
        "# 使用 Cleanlab 检测文本数据集中的问题\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0yPBE0Xccw4J"
      },
      "source": [
        "作者: [Aravind Putrevu](https://huggingface.co/aravindputrevu)\n",
        "\n",
        "在这个 5 分钟的快速入门教程中，我们将使用 Cleanlab 检测一个由在线银行（文本）客户服务请求组成的意图分类数据集中的各种问题。我们考虑的是 [Banking77-OOS数据集](https://arxiv.org/abs/2106.04564) 的一个子集，包含 1,000 个客户服务请求，根据它们的意图被分类为 10 个类别（你可以在任何文本分类数据集上运行相同的代码）。[Cleanlab](https://github.com/cleanlab/cleanlab)自动识别我们数据集中的坏例子，包括错误标记的数据、范围外的示例（离群值）或其他模糊不清的示例。在深入建模你的数据之前，请考虑过滤或更正这样的坏例子！\n",
        "\n",
        "**本教程我们将要做的事情概述：**\n",
        "\n",
        "- 使用预训练的 transformer 模型从客户服务请求中提取文本嵌入\n",
        "\n",
        "- 在文本嵌入上训练一个简单的逻辑回归模型，以计算样本外的预测概率\n",
        "\n",
        "- 使用这些预测和嵌入运行 Cleanlab 的 `Datalab` 审核，以识别数据集中的问题，如：标签问题、离群值和近重复项。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "o__pRLFYcw4K"
      },
      "source": [
        "## 快速入门\n",
        "\n",
        "已经有一个模型在现有标签集上训练得到的（样本外）`pred_probs` 了吗？也许你还有一些数值`特征`？运行下面的代码来查找数据集中的任何潜在标签错误。\n",
        "\n",
        "**注意：** 如果在 Colab 上运行，可能需要使用 GPU（选择：Runtime > Change runtime type > Hardware accelerator > GPU）\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qaZA0cFs1fW4"
      },
      "outputs": [],
      "source": [
        "from cleanlab import Datalab\n",
        "\n",
        "lab = Datalab(data=your_dataset, label_name=\"column_name_of_labels\")\n",
        "lab.find_issues(pred_probs=your_pred_probs, features=your_features)\n",
        "\n",
        "lab.report()\n",
        "lab.get_issues()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dp4lpApmcw4K"
      },
      "source": [
        "## 安装需要的依赖\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DjoWBgGAcw4K"
      },
      "source": [
        "你可以使用 `pip` 按照以下方式安装本教程所需的所有包：\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fRsBIj3L_RUb"
      },
      "outputs": [],
      "source": [
        "!pip install -U scikit-learn sentence-transformers datasets\n",
        "!pip install -U \"cleanlab[datalab]\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:13.467211Z",
          "iopub.status.busy": "2024-02-16T06:26:13.466877Z",
          "iopub.status.idle": "2024-02-16T06:26:13.470222Z",
          "shell.execute_reply": "2024-02-16T06:26:13.469761Z"
        },
        "id": "zgezWF-2cw4L"
      },
      "outputs": [],
      "source": [
        "import re\n",
        "import string\n",
        "import pandas as pd\n",
        "from sklearn.metrics import accuracy_score, log_loss\n",
        "from sklearn.model_selection import cross_val_predict\n",
        "from sklearn.linear_model import LogisticRegression\n",
        "from sentence_transformers import SentenceTransformer\n",
        "\n",
        "from cleanlab import Datalab"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:13.472374Z",
          "iopub.status.busy": "2024-02-16T06:26:13.471951Z",
          "iopub.status.idle": "2024-02-16T06:26:13.475065Z",
          "shell.execute_reply": "2024-02-16T06:26:13.474625Z"
        },
        "id": "mO3pnA1ncw4L",
        "nbsphinx": "hidden"
      },
      "outputs": [],
      "source": [
        "import random\n",
        "import numpy as np\n",
        "\n",
        "pd.set_option(\"display.max_colwidth\", None)\n",
        "\n",
        "SEED = 123456  # for reproducibility\n",
        "np.random.seed(SEED)\n",
        "random.seed(SEED)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yj_5JcO1cw4L"
      },
      "source": [
        "## 加载和格式化文本数据集"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:13.476949Z",
          "iopub.status.busy": "2024-02-16T06:26:13.476773Z",
          "iopub.status.idle": "2024-02-16T06:26:13.502278Z",
          "shell.execute_reply": "2024-02-16T06:26:13.501755Z"
        },
        "id": "HztO4qU9cw4L",
        "outputId": "c6ff9e95-6326-413e-a72f-6f3c05af1055"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "summary": "{\n  \"name\": \"data\",\n  \"rows\": 1000,\n  \"fields\": [\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1000,\n        \"samples\": [\n          \"I made an international purchase, but the exchange rate was wrong\",\n          \"I would like to know why a withdraw I made for some cash shows up as pending.\",\n          \"I tried to get cash out of the ATM but it is taking too long\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 12,\n        \"min\": 11,\n        \"max\": 46,\n        \"num_unique_values\": 7,\n        \"samples\": [\n          11,\n          13,\n          46\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
              "type": "dataframe",
              "variable_name": "data"
            },
            "text/html": [
              "\n",
              "  <div id=\"df-bb25ddea-7d53-4ee3-bdc9-92ba5f185022\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>text</th>\n",
              "      <th>label</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>I am still waiting on my card?</td>\n",
              "      <td>11</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>What can I do if my card still hasn't arrived after 2 weeks?</td>\n",
              "      <td>11</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>I have been waiting over a week. Is the card still coming?</td>\n",
              "      <td>11</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Can I track my card while it is in the process of delivery?</td>\n",
              "      <td>11</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>How do I know if I will get my card, or if it is lost?</td>\n",
              "      <td>11</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-bb25ddea-7d53-4ee3-bdc9-92ba5f185022')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-bb25ddea-7d53-4ee3-bdc9-92ba5f185022 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-bb25ddea-7d53-4ee3-bdc9-92ba5f185022');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-51d80644-98e0-40b3-89a9-c326994791c6\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-51d80644-98e0-40b3-89a9-c326994791c6')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-51d80644-98e0-40b3-89a9-c326994791c6 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "                                                           text  label\n",
              "0                                I am still waiting on my card?     11\n",
              "1  What can I do if my card still hasn't arrived after 2 weeks?     11\n",
              "2    I have been waiting over a week. Is the card still coming?     11\n",
              "3   Can I track my card while it is in the process of delivery?     11\n",
              "4        How do I know if I will get my card, or if it is lost?     11"
            ]
          },
          "execution_count": 24,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from datasets import load_dataset\n",
        "\n",
        "dataset = load_dataset(\"PolyAI/banking77\", split=\"train\")\n",
        "data = pd.DataFrame(dataset[:1000])\n",
        "data.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:13.504463Z",
          "iopub.status.busy": "2024-02-16T06:26:13.504049Z",
          "iopub.status.idle": "2024-02-16T06:26:13.508243Z",
          "shell.execute_reply": "2024-02-16T06:26:13.507706Z"
        },
        "id": "Ujp0luqRcw4M",
        "outputId": "b438fed5-aa75-450d-dc84-0b3398960487"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "This dataset has 7 classes.\n",
            "Classes: {32, 34, 36, 11, 13, 46, 17}\n"
          ]
        }
      ],
      "source": [
        "raw_texts, labels = data[\"text\"].values, data[\"label\"].values\n",
        "num_classes = len(set(labels))\n",
        "\n",
        "print(f\"This dataset has {num_classes} classes.\")\n",
        "print(f\"Classes: {set(labels)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PVza57cecw4M"
      },
      "source": [
        "让我们查看数据集中的第 i 个示例："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:13.510435Z",
          "iopub.status.busy": "2024-02-16T06:26:13.510163Z",
          "iopub.status.idle": "2024-02-16T06:26:13.513358Z",
          "shell.execute_reply": "2024-02-16T06:26:13.512906Z"
        },
        "id": "lXHi90Kecw4M",
        "outputId": "af8a9b19-986f-44fe-c564-dd83e400309e"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Example Label: 11\n",
            "Example Text: What can I do if my card still hasn't arrived after 2 weeks?\n"
          ]
        }
      ],
      "source": [
        "i = 1  # change this to view other examples from the dataset\n",
        "print(f\"Example Label: {labels[i]}\")\n",
        "print(f\"Example Text: {raw_texts[i]}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JH7UU9Wscw4M"
      },
      "source": [
        "数据以两个 numpy 数组的形式存储：\n",
        "1. `raw_texts` 以文本格式存储客户服务请求的话语\n",
        "2. `labels` 存储每个示例的意图类别（标签）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "T0d80apCcw4M"
      },
      "source": [
        "<div class=\"alert alert-info\">\n",
        "\n",
        "自有数据？\n",
        "\n",
        "你可以轻松地将上述内容替换为你自己的文本数据集，并继续进行教程的其余部分。\n",
        "\n",
        "\n",
        "</div>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YLDeD09Ncw4M"
      },
      "source": [
        "接下来，我们将文本字符串转换为更适合作为机器学习模型输入的向量。\n",
        "\n",
        "我们将使用预训练的 Transformer 模型提供的数值表示作为我们文本的嵌入。[Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) 库提供了计算文本数据嵌入的简单方法。在这里，我们加载了预训练的 `electra-small-discriminator` 模型，然后通过网络运行我们的数据，以提取每个示例的向量嵌入。\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:13.515306Z",
          "iopub.status.busy": "2024-02-16T06:26:13.515126Z",
          "iopub.status.idle": "2024-02-16T06:26:18.244024Z",
          "shell.execute_reply": "2024-02-16T06:26:18.243354Z"
        },
        "id": "DbDb6Ni6cw4M"
      },
      "outputs": [],
      "source": [
        "transformer = SentenceTransformer('google/electra-small-discriminator')\n",
        "text_embeddings = transformer.encode(raw_texts)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Moz0KJvzcw4M"
      },
      "source": [
        "我们后续的机器学习模型将直接在 `text_embeddings` 的元素上操作，以便对客户服务请求进行分类。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4FK2Q72gcw4M"
      },
      "source": [
        "## 定义一个分类模型并计算样本外的预测概率"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yaicOGrhcw4N"
      },
      "source": [
        "   为了利用预训练网络进行特定的分类任务，一种典型的方法是添加一个线性输出层，并在新数据上微调网络参数。然而，这可能需要大量的计算资源。另一种方法是冻结网络的预训练权重，只训练输出层，而不依赖于 GPU。在这里，我们通过在提取的嵌入顶部拟合一个 scikit-learn 线性模型来方便地实现这一点。\n",
        "\n",
        "   为了识别标签问题，cleanlab 需要你的模型为每个数据点提供概率预测。然而，对于模型之前训练过的数据点，这些预测将是过拟合的（因此不可靠）。cleanlab 旨在仅与**样本外**的预测类概率一起使用，即在模型训练期间保持不变的数据点。\n",
        "\n",
        "   在这里，我们使用带有交叉验证的逻辑回归模型来获得数据集中每个示例的样本外预测类概率。\n",
        "   确保你的 `pred_probs` 列根据类的排序正确排序，对于 Datalab 来说，是：按类名字典顺序排序。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:18.247142Z",
          "iopub.status.busy": "2024-02-16T06:26:18.246652Z",
          "iopub.status.idle": "2024-02-16T06:26:19.133641Z",
          "shell.execute_reply": "2024-02-16T06:26:19.132953Z"
        },
        "id": "tiIqp1arcw4N",
        "scrolled": true
      },
      "outputs": [],
      "source": [
        "model = LogisticRegression(max_iter=400)\n",
        "\n",
        "pred_probs = cross_val_predict(model, text_embeddings, labels, method=\"predict_proba\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9s0pcMk1cw4N"
      },
      "source": [
        "## 使用 Cleanlab 查找数据集中的问题"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qa8ltsx9cw4N"
      },
      "source": [
        "在给定来自你拥有的任何模型的特征嵌入和（样本外）预测类概率的情况下，cleanlab 可以帮助你快速识别数据中的低质量示例。\n",
        "\n",
        "在这里，我们使用 Cleanlab 的 `Datalab` 来查找数据中的问题。Datalab 提供了几种加载数据的方式；我们将简单地在字典中包装训练特征和噪声标签。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:19.136722Z",
          "iopub.status.busy": "2024-02-16T06:26:19.136482Z",
          "iopub.status.idle": "2024-02-16T06:26:19.139419Z",
          "shell.execute_reply": "2024-02-16T06:26:19.138870Z"
        },
        "id": "UNj4rWW2cw4N"
      },
      "outputs": [],
      "source": [
        "data_dict = {\"texts\": raw_texts, \"labels\": labels}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IpNmBc_Lcw4N"
      },
      "source": [
        "审核你的数据所需的全部操作就是调用 `find_issues()`。我们传入上面获得的预测概率和特征嵌入，但你不一定需要提供所有这些信息，具体取决于你对哪些类型的问题感兴趣。你提供的输入越多，`Datalab` 就能在你的数据中检测到更多类型的问题。使用更好的模型来生成这些输入将确保 cleanlab 更准确地估计问题。\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:19.141893Z",
          "iopub.status.busy": "2024-02-16T06:26:19.141673Z",
          "iopub.status.idle": "2024-02-16T06:26:20.809087Z",
          "shell.execute_reply": "2024-02-16T06:26:20.808461Z"
        },
        "id": "R0xuUDRWcw4N",
        "scrolled": true
      },
      "outputs": [],
      "source": [
        "lab = Datalab(data_dict, label_name=\"labels\")\n",
        "lab.find_issues(pred_probs=pred_probs, features=text_embeddings)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d6Iqy0vGq7w9"
      },
      "source": [
        "输出看起来如下：\n",
        "\n",
        "```bash\n",
        "Finding null issues ...\n",
        "Finding label issues ...\n",
        "Finding outlier issues ...\n",
        "Fitting OOD estimator based on provided features ...\n",
        "Finding near_duplicate issues ...\n",
        "Finding non_iid issues ...\n",
        "Finding class_imbalance issues ...\n",
        "Finding underperforming_group issues ...\n",
        "\n",
        "Audit complete. 62 issues found in the dataset.\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4aitesJccw4N"
      },
      "source": [
        "审计完成后，使用 `report` 方法来查看审计结果。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:20.813057Z",
          "iopub.status.busy": "2024-02-16T06:26:20.811515Z",
          "iopub.status.idle": "2024-02-16T06:26:20.838760Z",
          "shell.execute_reply": "2024-02-16T06:26:20.838088Z"
        },
        "id": "ALXu32nzcw4N",
        "outputId": "733d2ed4-5bcd-49e6-93a7-285f3d66278c",
        "scrolled": true
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Here is a summary of the different kinds of issues found in the data:\n",
            "\n",
            "    issue_type  num_issues\n",
            "       outlier          37\n",
            "near_duplicate          14\n",
            "         label          10\n",
            "       non_iid           1\n",
            "\n",
            "Dataset Information: num_examples: 1000, num_classes: 7\n",
            "\n",
            "\n",
            "---------------------- outlier issues ----------------------\n",
            "\n",
            "About this issue:\n",
            "\tExamples that are very different from the rest of the dataset \n",
            "    (i.e. potentially out-of-distribution or rare/anomalous instances).\n",
            "    \n",
            "\n",
            "Number of examples with this issue: 37\n",
            "Overall dataset quality in terms of this issue: 0.3671\n",
            "\n",
            "Examples representing most severe instances of this issue:\n",
            "     is_outlier_issue  outlier_score\n",
            "791              True       0.024866\n",
            "601              True       0.031162\n",
            "863              True       0.060738\n",
            "355              True       0.064199\n",
            "157              True       0.065075\n",
            "\n",
            "\n",
            "------------------ near_duplicate issues -------------------\n",
            "\n",
            "About this issue:\n",
            "\tA (near) duplicate issue refers to two or more examples in\n",
            "    a dataset that are extremely similar to each other, relative\n",
            "    to the rest of the dataset.  The examples flagged with this issue\n",
            "    may be exactly duplicated, or lie atypically close together when\n",
            "    represented as vectors (i.e. feature embeddings).\n",
            "    \n",
            "\n",
            "Number of examples with this issue: 14\n",
            "Overall dataset quality in terms of this issue: 0.5961\n",
            "\n",
            "Examples representing most severe instances of this issue:\n",
            "     is_near_duplicate_issue  near_duplicate_score near_duplicate_sets  distance_to_nearest_neighbor\n",
            "459                     True              0.009544               [429]                      0.000566\n",
            "429                     True              0.009544               [459]                      0.000566\n",
            "501                     True              0.046044          [412, 517]                      0.002781\n",
            "412                     True              0.046044               [501]                      0.002781\n",
            "698                     True              0.054626               [607]                      0.003314\n",
            "\n",
            "\n",
            "----------------------- label issues -----------------------\n",
            "\n",
            "About this issue:\n",
            "\tExamples whose given label is estimated to be potentially incorrect\n",
            "    (e.g. due to annotation error) are flagged as having label issues.\n",
            "    \n",
            "\n",
            "Number of examples with this issue: 10\n",
            "Overall dataset quality in terms of this issue: 0.9930\n",
            "\n",
            "Examples representing most severe instances of this issue:\n",
            "     is_label_issue  label_score  given_label  predicted_label\n",
            "379           False     0.025486           32               11\n",
            "100           False     0.032102           11               36\n",
            "300           False     0.037742           32               46\n",
            "485            True     0.057666           17               34\n",
            "159            True     0.059408           13               11\n",
            "\n",
            "\n",
            "---------------------- non_iid issues ----------------------\n",
            "\n",
            "About this issue:\n",
            "\tWhether the dataset exhibits statistically significant\n",
            "    violations of the IID assumption like:\n",
            "    changepoints or shift, drift, autocorrelation, etc.\n",
            "    The specific violation considered is whether the\n",
            "    examples are ordered such that almost adjacent examples\n",
            "    tend to have more similar feature values.\n",
            "    \n",
            "\n",
            "Number of examples with this issue: 1\n",
            "Overall dataset quality in terms of this issue: 0.0000\n",
            "\n",
            "Examples representing most severe instances of this issue:\n",
            "     is_non_iid_issue  non_iid_score\n",
            "988              True       0.563774\n",
            "975             False       0.570179\n",
            "997             False       0.571891\n",
            "967             False       0.572357\n",
            "956             False       0.577413\n",
            "\n",
            "Additional Information: \n",
            "p-value: 0.0\n"
          ]
        }
      ],
      "source": [
        "lab.report()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sAuLE6Macw4N"
      },
      "source": [
        "### 标签问题\n",
        "\n",
        "报告显示 cleanlab 在我们的数据集中识别出了许多标签问题。我们可以使用 `get_issues` 方法来查看哪些示例被标记为可能标签错误，以及每个示例的标签质量分数，通过指定 `label` 作为参数来关注数据中的标签问题。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:20.843083Z",
          "iopub.status.busy": "2024-02-16T06:26:20.842045Z",
          "iopub.status.idle": "2024-02-16T06:26:20.852505Z",
          "shell.execute_reply": "2024-02-16T06:26:20.852016Z"
        },
        "id": "6gATaXWscw4N",
        "outputId": "0d0e70c5-1548-4fe6-b67e-668c8dfedf0e",
        "scrolled": true
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "summary": "{\n  \"name\": \"label_issues\",\n  \"rows\": 1000,\n  \"fields\": [\n    {\n      \"column\": \"is_label_issue\",\n      \"properties\": {\n        \"dtype\": \"boolean\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          true,\n          false\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"label_score\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.2150390046430028,\n        \"min\": 0.025486333476725527,\n        \"max\": 0.999751760644687,\n        \"num_unique_values\": 1000,\n        \"samples\": [\n          0.98954913626076,\n          0.44264330724848383\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"given_label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 12,\n        \"min\": 11,\n        \"max\": 46,\n        \"num_unique_values\": 7,\n        \"samples\": [\n          11,\n          13\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"predicted_label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 12,\n        \"min\": 11,\n        \"max\": 46,\n        \"num_unique_values\": 7,\n        \"samples\": [\n          11,\n          13\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
              "type": "dataframe",
              "variable_name": "label_issues"
            },
            "text/html": [
              "\n",
              "  <div id=\"df-5a26d9e1-9d7e-4327-9c38-a2fa12e59f28\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>is_label_issue</th>\n",
              "      <th>label_score</th>\n",
              "      <th>given_label</th>\n",
              "      <th>predicted_label</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>False</td>\n",
              "      <td>0.903926</td>\n",
              "      <td>11</td>\n",
              "      <td>11</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>False</td>\n",
              "      <td>0.860544</td>\n",
              "      <td>11</td>\n",
              "      <td>11</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>False</td>\n",
              "      <td>0.658309</td>\n",
              "      <td>11</td>\n",
              "      <td>11</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>False</td>\n",
              "      <td>0.697085</td>\n",
              "      <td>11</td>\n",
              "      <td>11</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>False</td>\n",
              "      <td>0.434934</td>\n",
              "      <td>11</td>\n",
              "      <td>11</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5a26d9e1-9d7e-4327-9c38-a2fa12e59f28')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-5a26d9e1-9d7e-4327-9c38-a2fa12e59f28 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-5a26d9e1-9d7e-4327-9c38-a2fa12e59f28');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-4dd98a1c-b39c-464d-af07-2f385d2054b1\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-4dd98a1c-b39c-464d-af07-2f385d2054b1')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-4dd98a1c-b39c-464d-af07-2f385d2054b1 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "   is_label_issue  label_score  given_label  predicted_label\n",
              "0           False     0.903926           11               11\n",
              "1           False     0.860544           11               11\n",
              "2           False     0.658309           11               11\n",
              "3           False     0.697085           11               11\n",
              "4           False     0.434934           11               11"
            ]
          },
          "execution_count": 32,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "label_issues = lab.get_issues(\"label\")\n",
        "label_issues.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eBLFyMMcs5NT"
      },
      "source": [
        "| | is_label_issue | label_score | given_label | predicted_label |\n",
        "|----------------|-------------|-------------|-----------------|-----------------|\n",
        "| 0              | False       | 0.903926    | 11              | 11 |\n",
        "| 1              | False       | 0.860544    | 11              | 11 |\n",
        "| 2              | False       | 0.658309    | 11              | 11 |\n",
        "| 3              | False       | 0.697085    | 11              | 11 |\n",
        "| 4              | False       | 0.434934    | 11              | 11 |\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-tYlhmKYcw4N"
      },
      "source": [
        "此方法返回一个包含每个示例的标签质量分数的数据框。这些数值分数介于 0 和 1 之间，其中较低的分数表示更可能是错误标记的示例。数据框还包含一个布尔列，指定是否将每个示例识别为具有标签问题（表明它可能是错误标记的）。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XcD-oCLlcw4N"
      },
      "source": [
        "我们可以获取标记有标签问题的示例的子集，并且还可以按标签质量分数排序，以找到数据集中最可能错误标记的 5 个示例的索引。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:20.854743Z",
          "iopub.status.busy": "2024-02-16T06:26:20.854394Z",
          "iopub.status.idle": "2024-02-16T06:26:20.858961Z",
          "shell.execute_reply": "2024-02-16T06:26:20.858409Z"
        },
        "id": "QtloV-NBcw4N",
        "outputId": "86c32e99-7dc8-470c-b102-f0f5acc13855"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "cleanlab found 10 potential label errors in the dataset.\n",
            "Here are indices of the top 5 most likely errors: \n",
            " [379 100 300 485 159]\n"
          ]
        }
      ],
      "source": [
        "identified_label_issues = label_issues[label_issues[\"is_label_issue\"] == True]\n",
        "lowest_quality_labels = label_issues[\"label_score\"].argsort()[:5].to_numpy()\n",
        "\n",
        "print(\n",
        "    f\"cleanlab found {len(identified_label_issues)} potential label errors in the dataset.\\n\"\n",
        "    f\"Here are indices of the top 5 most likely errors: \\n {lowest_quality_labels}\"\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8J49bTeocw4N"
      },
      "source": [
        "让我们查看一些最可能的标签错误。\n",
        "\n",
        "这里我们展示了数据集中被识别为最可能的标签错误的前 5 个示例，以及它们的给定（原始）标签和 cleanlab 提供的建议替代标签。\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 276
        },
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:20.861048Z",
          "iopub.status.busy": "2024-02-16T06:26:20.860742Z",
          "iopub.status.idle": "2024-02-16T06:26:20.867443Z",
          "shell.execute_reply": "2024-02-16T06:26:20.866904Z"
        },
        "id": "c-niFVJvcw4N",
        "outputId": "5bbc5217-3581-4e2e-8b56-7a1fc77cc427"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "summary": "{\n  \"name\": \"data_with_suggested_labels\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"can you share card tracking number?\",\n          \"Is there any way to see my card in the app?\",\n          \"If I need to cash foreign transfers, how does that work?\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"given_label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 10,\n        \"min\": 11,\n        \"max\": 32,\n        \"num_unique_values\": 4,\n        \"samples\": [\n          11,\n          13,\n          32\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"suggested_label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 15,\n        \"min\": 11,\n        \"max\": 46,\n        \"num_unique_values\": 4,\n        \"samples\": [\n          36,\n          34,\n          11\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
              "type": "dataframe"
            },
            "text/html": [
              "\n",
              "  <div id=\"df-b3e89f50-5593-4907-a4ca-3b41c4a5e72c\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>text</th>\n",
              "      <th>given_label</th>\n",
              "      <th>suggested_label</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>379</th>\n",
              "      <td>Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from?</td>\n",
              "      <td>32</td>\n",
              "      <td>11</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>100</th>\n",
              "      <td>can you share card tracking number?</td>\n",
              "      <td>11</td>\n",
              "      <td>36</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>300</th>\n",
              "      <td>If I need to cash foreign transfers, how does that work?</td>\n",
              "      <td>32</td>\n",
              "      <td>46</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>485</th>\n",
              "      <td>Was I charged more than I should of been for a currency exchange?</td>\n",
              "      <td>17</td>\n",
              "      <td>34</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>159</th>\n",
              "      <td>Is there any way to see my card in the app?</td>\n",
              "      <td>13</td>\n",
              "      <td>11</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b3e89f50-5593-4907-a4ca-3b41c4a5e72c')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-b3e89f50-5593-4907-a4ca-3b41c4a5e72c button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-b3e89f50-5593-4907-a4ca-3b41c4a5e72c');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-8e310624-5a8c-4409-96e1-9757bd48d51a\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-8e310624-5a8c-4409-96e1-9757bd48d51a')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-8e310624-5a8c-4409-96e1-9757bd48d51a button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "                                                                                                          text  \\\n",
              "379  Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from?   \n",
              "100                                                                        can you share card tracking number?   \n",
              "300                                                   If I need to cash foreign transfers, how does that work?   \n",
              "485                                          Was I charged more than I should of been for a currency exchange?   \n",
              "159                                                                Is there any way to see my card in the app?   \n",
              "\n",
              "     given_label  suggested_label  \n",
              "379           32               11  \n",
              "100           11               36  \n",
              "300           32               46  \n",
              "485           17               34  \n",
              "159           13               11  "
            ]
          },
          "execution_count": 18,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "data_with_suggested_labels = pd.DataFrame(\n",
        "    {\"text\": raw_texts, \"given_label\": labels, \"suggested_label\": label_issues[\"predicted_label\"]}\n",
        ")\n",
        "data_with_suggested_labels.iloc[lowest_quality_labels]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "g2dvMySPtkbL"
      },
      "source": [
        "上面命令的输出如下所示：\n",
        "  \n",
        "|      | text                                                                                                      | given_label    | suggested_label |\n",
        "|------|-----------------------------------------------------------------------------------------------------------|----------------|-----------------|\n",
        "| 379  | Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from? | 32             | 11              |\n",
        "| 100  | can you share card tracking number?                                                                       | 11             | 36              |\n",
        "| 300  | If I need to cash foreign transfers, how does that work?                                                  | 32             | 46              |\n",
        "| 485  | Was I charged more than I should of been for a currency exchange?                                         | 17             | 34              |\n",
        "| 159  | Is there any way to see my card in the app?                                                               | 13             | 11              |\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eH8ltGj0cw4O",
        "scrolled": true
      },
      "source": [
        "这些是 cleanlab 在此数据中识别的非常清晰的标签错误！请注意，`given_label` 并没有正确反映这些请求的意图，无论谁制作了这个数据集，在建模数据之前都需要解决许多错误。\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ULFeD3bzcw4O"
      },
      "source": [
        "### 离群值问题\n",
        "\n",
        "根据报告，我们的数据集中包含一些离群值。\n",
        "\n",
        "我们可以通过 `get_issues` 查看哪些示例是离群值（以及一个数值质量分数，量化每个示例看起来有多么典型）。我们将结果数据框按照 cleanlab 的离群值质量分数排序，以查看数据集中最严重的离群值。\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:20.869718Z",
          "iopub.status.busy": "2024-02-16T06:26:20.869251Z",
          "iopub.status.idle": "2024-02-16T06:26:20.876386Z",
          "shell.execute_reply": "2024-02-16T06:26:20.875851Z"
        },
        "id": "jBLuqUXBcw4O",
        "outputId": "d5d2dbc6-c708-4750-e3ea-6dcd5c24a64d"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "summary": "{\n  \"name\": \"outlier_issues\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"is_outlier_issue\",\n      \"properties\": {\n        \"dtype\": \"boolean\",\n        \"num_unique_values\": 1,\n        \"samples\": [\n          true\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"outlier_score\",\n      \"properties\": {\n        \"dtype\": \"float32\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          0.03116183541715145\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
              "type": "dataframe"
            },
            "text/html": [
              "\n",
              "  <div id=\"df-b302658a-5cf7-4fcc-a69d-150ef156f2ce\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>is_outlier_issue</th>\n",
              "      <th>outlier_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>791</th>\n",
              "      <td>True</td>\n",
              "      <td>0.024866</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>601</th>\n",
              "      <td>True</td>\n",
              "      <td>0.031162</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>863</th>\n",
              "      <td>True</td>\n",
              "      <td>0.060738</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>355</th>\n",
              "      <td>True</td>\n",
              "      <td>0.064199</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>157</th>\n",
              "      <td>True</td>\n",
              "      <td>0.065075</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b302658a-5cf7-4fcc-a69d-150ef156f2ce')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-b302658a-5cf7-4fcc-a69d-150ef156f2ce button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-b302658a-5cf7-4fcc-a69d-150ef156f2ce');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-6b51f708-0b0d-469d-b68a-156d487823c5\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-6b51f708-0b0d-469d-b68a-156d487823c5')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-6b51f708-0b0d-469d-b68a-156d487823c5 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "     is_outlier_issue  outlier_score\n",
              "791              True       0.024866\n",
              "601              True       0.031162\n",
              "863              True       0.060738\n",
              "355              True       0.064199\n",
              "157              True       0.065075"
            ]
          },
          "execution_count": 20,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "outlier_issues = lab.get_issues(\"outlier\")\n",
        "outlier_issues.sort_values(\"outlier_score\").head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "F7Z2VJQAujui"
      },
      "source": [
        "输出如下所示：\n",
        "\n",
        "|   | is_outlier_issue | outlier_score |\n",
        "|---| ----------------|---------------|\n",
        "| 791 | True             | 0.024866      |\n",
        "| 601 | True             | 0.031162      |\n",
        "| 863 | True             | 0.060738      |\n",
        "| 355 | True             | 0.064199      |\n",
        "| 157 | True             | 0.065075      |"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 246
        },
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:20.878435Z",
          "iopub.status.busy": "2024-02-16T06:26:20.878117Z",
          "iopub.status.idle": "2024-02-16T06:26:20.884073Z",
          "shell.execute_reply": "2024-02-16T06:26:20.883533Z"
        },
        "id": "Kjn-muLGcw4O",
        "outputId": "a5ae0a32-cac4-442d-89fc-8f7f64da9dfc"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "summary": "{\n  \"name\": \"data\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"$1 charge in transaction.\",\n          \"lost card found, want to put it back in app\",\n          \"My atm withdraw is stillpending\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 13,\n        \"min\": 13,\n        \"max\": 46,\n        \"num_unique_values\": 4,\n        \"samples\": [\n          34,\n          13,\n          46\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
              "type": "dataframe"
            },
            "text/html": [
              "\n",
              "  <div id=\"df-30ae3a5c-a5ac-48e2-be07-1bc96f12dd24\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>text</th>\n",
              "      <th>label</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>791</th>\n",
              "      <td>withdrawal pending meaning?</td>\n",
              "      <td>46</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>601</th>\n",
              "      <td>$1 charge in transaction.</td>\n",
              "      <td>34</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>863</th>\n",
              "      <td>My atm withdraw is stillpending</td>\n",
              "      <td>46</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>355</th>\n",
              "      <td>explain the interbank exchange rate</td>\n",
              "      <td>32</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>157</th>\n",
              "      <td>lost card found, want to put it back in app</td>\n",
              "      <td>13</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-30ae3a5c-a5ac-48e2-be07-1bc96f12dd24')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-30ae3a5c-a5ac-48e2-be07-1bc96f12dd24 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-30ae3a5c-a5ac-48e2-be07-1bc96f12dd24');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-bddbc2a7-3f55-4d4b-9f5b-bad055b943b6\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-bddbc2a7-3f55-4d4b-9f5b-bad055b943b6')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-bddbc2a7-3f55-4d4b-9f5b-bad055b943b6 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "                                            text  label\n",
              "791                  withdrawal pending meaning?     46\n",
              "601                    $1 charge in transaction.     34\n",
              "863              My atm withdraw is stillpending     46\n",
              "355          explain the interbank exchange rate     32\n",
              "157  lost card found, want to put it back in app     13"
            ]
          },
          "execution_count": 34,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "lowest_quality_outliers = outlier_issues[\"outlier_score\"].argsort()[:5]\n",
        "\n",
        "data.iloc[lowest_quality_outliers]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kuZMsLPZYARL"
      },
      "source": [
        "对于质量最低的离群值，样本输出将如下所示：\n",
        "\n",
        "|index|text|label|\n",
        "|---|---|---|\n",
        "|791|withdrawal pending meaning?|46|\n",
        "|601|$1 charge in transaction\\.|34|\n",
        "|863|My atm withdraw is stillpending|46|\n",
        "|355|explain the interbank exchange rate|32|\n",
        "|157|lost card found, want to put it back in app|13|\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sBal-KDrcw4R"
      },
      "source": [
        "我们看到 cleanlab 已经识别出这个数据集中的条目，这些条目看起来并不是正确的客户请求。此数据集中的离群值似乎是不在范围内的客户请求和其他对意图分类没有意义的非语义文本。仔细考虑这些离群值是否可能对你的数据建模产生不利影响，如果有可能的话，考虑从数据集中移除它们。\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ch71b_0qcw4S"
      },
      "source": [
        "### 近重复问题\n",
        "\n",
        "根据报告，我们的数据集中包含一些几乎重复的示例集。\n",
        "我们可以通过 `get_issues` 查看哪些示例是（几乎）重复的（以及一个数值质量分数，量化每个示例与数据集中最近邻的相似程度）。我们将结果数据框按照 cleanlab 的近重复质量分数排序，以查看数据集中最接近重复的文本示例。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 226
        },
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:20.886079Z",
          "iopub.status.busy": "2024-02-16T06:26:20.885805Z",
          "iopub.status.idle": "2024-02-16T06:26:20.894466Z",
          "shell.execute_reply": "2024-02-16T06:26:20.893919Z"
        },
        "id": "TbI49Rdccw4S",
        "outputId": "1978cdb5-02c2-4f82-e7d5-553ad1b6dca9"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "summary": "{\n  \"name\": \"duplicate_issues\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"is_near_duplicate_issue\",\n      \"properties\": {\n        \"dtype\": \"boolean\",\n        \"num_unique_values\": 1,\n        \"samples\": [\n          true\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"near_duplicate_score\",\n      \"properties\": {\n        \"dtype\": \"float32\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          0.00954437255859375\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"near_duplicate_sets\",\n      \"properties\": {\n        \"dtype\": \"object\",\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"distance_to_nearest_neighbor\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.0013286758192588926,\n        \"min\": 0.0005658268928527832,\n        \"max\": 0.0033143162727355957,\n        \"num_unique_values\": 3,\n        \"samples\": [\n          0.0005658268928527832\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
              "type": "dataframe"
            },
            "text/html": [
              "\n",
              "  <div id=\"df-d081a3cf-0a30-4c25-9122-25edb4f5cb8d\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>is_near_duplicate_issue</th>\n",
              "      <th>near_duplicate_score</th>\n",
              "      <th>near_duplicate_sets</th>\n",
              "      <th>distance_to_nearest_neighbor</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>459</th>\n",
              "      <td>True</td>\n",
              "      <td>0.009544</td>\n",
              "      <td>[429]</td>\n",
              "      <td>0.000566</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>429</th>\n",
              "      <td>True</td>\n",
              "      <td>0.009544</td>\n",
              "      <td>[459]</td>\n",
              "      <td>0.000566</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>501</th>\n",
              "      <td>True</td>\n",
              "      <td>0.046044</td>\n",
              "      <td>[412, 517]</td>\n",
              "      <td>0.002781</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>412</th>\n",
              "      <td>True</td>\n",
              "      <td>0.046044</td>\n",
              "      <td>[501]</td>\n",
              "      <td>0.002781</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>698</th>\n",
              "      <td>True</td>\n",
              "      <td>0.054626</td>\n",
              "      <td>[607]</td>\n",
              "      <td>0.003314</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d081a3cf-0a30-4c25-9122-25edb4f5cb8d')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-d081a3cf-0a30-4c25-9122-25edb4f5cb8d button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-d081a3cf-0a30-4c25-9122-25edb4f5cb8d');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-87faf38c-7402-4ae8-b190-5cecbc434665\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-87faf38c-7402-4ae8-b190-5cecbc434665')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-87faf38c-7402-4ae8-b190-5cecbc434665 button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "     is_near_duplicate_issue  near_duplicate_score near_duplicate_sets  \\\n",
              "459                     True              0.009544               [429]   \n",
              "429                     True              0.009544               [459]   \n",
              "501                     True              0.046044          [412, 517]   \n",
              "412                     True              0.046044               [501]   \n",
              "698                     True              0.054626               [607]   \n",
              "\n",
              "     distance_to_nearest_neighbor  \n",
              "459                      0.000566  \n",
              "429                      0.000566  \n",
              "501                      0.002781  \n",
              "412                      0.002781  \n",
              "698                      0.003314  "
            ]
          },
          "execution_count": 35,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "duplicate_issues = lab.get_issues(\"near_duplicate\")\n",
        "duplicate_issues.sort_values(\"near_duplicate_score\").head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EawP0y1Lcw4S"
      },
      "source": [
        "上面的结果显示了 cleanlab 认为哪些示例是近重复的（`is_near_duplicate_issue == True` 的行）。在这里，我们看到示例 459 和 429 是近重复的，示例 501 和 412 也是近重复的。\n",
        "\n",
        "让我们查看这些示例，看看它们有多么相似。\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 182
        },
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:20.896501Z",
          "iopub.status.busy": "2024-02-16T06:26:20.896175Z",
          "iopub.status.idle": "2024-02-16T06:26:20.901983Z",
          "shell.execute_reply": "2024-02-16T06:26:20.901420Z"
        },
        "id": "0TEW5igFcw4S",
        "outputId": "86343985-26bb-44ce-f27b-610357f43030"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "summary": "{\n  \"name\": \"data\",\n  \"rows\": 2,\n  \"fields\": [\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"I purchased something overseas and the incorrect exchange rate was applied.\",\n          \"I purchased something abroad and the incorrect exchange rate was applied.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0,\n        \"min\": 17,\n        \"max\": 17,\n        \"num_unique_values\": 1,\n        \"samples\": [\n          17\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
              "type": "dataframe"
            },
            "text/html": [
              "\n",
              "  <div id=\"df-262b7208-d7d7-4fba-9d6c-7bda298822bc\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>text</th>\n",
              "      <th>label</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>459</th>\n",
              "      <td>I purchased something abroad and the incorrect exchange rate was applied.</td>\n",
              "      <td>17</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>429</th>\n",
              "      <td>I purchased something overseas and the incorrect exchange rate was applied.</td>\n",
              "      <td>17</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-262b7208-d7d7-4fba-9d6c-7bda298822bc')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-262b7208-d7d7-4fba-9d6c-7bda298822bc button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-262b7208-d7d7-4fba-9d6c-7bda298822bc');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-af242cbb-1eba-4356-88ea-3d159158902a\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-af242cbb-1eba-4356-88ea-3d159158902a')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-af242cbb-1eba-4356-88ea-3d159158902a button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "                                                                            text  \\\n",
              "459    I purchased something abroad and the incorrect exchange rate was applied.   \n",
              "429  I purchased something overseas and the incorrect exchange rate was applied.   \n",
              "\n",
              "     label  \n",
              "459     17  \n",
              "429     17  "
            ]
          },
          "execution_count": 38,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "data.iloc[[459, 429]]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DoAyD-FZpsSm"
      },
      "source": [
        "样本输出：\n",
        "\n",
        "|index|text|label|\n",
        "|---|---|---|\n",
        "|459|I purchased something abroad and the incorrect exchange rate was applied\\.|17|\n",
        "|429|I purchased something overseas and the incorrect exchange rate was applied\\.|17|"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 198
        },
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:20.904159Z",
          "iopub.status.busy": "2024-02-16T06:26:20.903821Z",
          "iopub.status.idle": "2024-02-16T06:26:20.909681Z",
          "shell.execute_reply": "2024-02-16T06:26:20.909160Z"
        },
        "id": "VnbIBYaHcw4S",
        "outputId": "8b00bb96-0d9d-43f6-b85f-c41e437d41b5"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "summary": "{\n  \"name\": \"data\",\n  \"rows\": 2,\n  \"fields\": [\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"The exchange rate you are using is bad.This can't be the official interbank exchange rate.\",\n          \"The exchange rate you are using is really bad.This can't be the official interbank exchange rate.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0,\n        \"min\": 17,\n        \"max\": 17,\n        \"num_unique_values\": 1,\n        \"samples\": [\n          17\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
              "type": "dataframe"
            },
            "text/html": [
              "\n",
              "  <div id=\"df-85c08b5b-da74-4092-97e6-f8908702eeaf\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>text</th>\n",
              "      <th>label</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>501</th>\n",
              "      <td>The exchange rate you are using is really bad.This can't be the official interbank exchange rate.</td>\n",
              "      <td>17</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>412</th>\n",
              "      <td>The exchange rate you are using is bad.This can't be the official interbank exchange rate.</td>\n",
              "      <td>17</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-85c08b5b-da74-4092-97e6-f8908702eeaf')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-85c08b5b-da74-4092-97e6-f8908702eeaf button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-85c08b5b-da74-4092-97e6-f8908702eeaf');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-74bfcb88-0e60-4aec-a441-90ce1d28370b\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-74bfcb88-0e60-4aec-a441-90ce1d28370b')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-74bfcb88-0e60-4aec-a441-90ce1d28370b button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "                                                                                                  text  \\\n",
              "501  The exchange rate you are using is really bad.This can't be the official interbank exchange rate.   \n",
              "412         The exchange rate you are using is bad.This can't be the official interbank exchange rate.   \n",
              "\n",
              "     label  \n",
              "501     17  \n",
              "412     17  "
            ]
          },
          "execution_count": 39,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "data.iloc[[501, 412]]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Y4QD35-dqeGg"
      },
      "source": [
        "样本输出：\n",
        "\n",
        "|index|text|label|\n",
        "|---|---|---|\n",
        "|501|The exchange rate you are using is really bad\\.This can't be the official interbank exchange rate\\.|17|\n",
        "|412|The exchange rate you are using is bad\\.This can't be the official interbank exchange rate\\.|17|"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UG8xfTa5cw4S"
      },
      "source": [
        "我们看到这两组请求确实非常相似！在数据集中包含近重复项可能会对模型产生意想不到的影响，并且要小心不要将它们分割到训练/测试集中。从[常见问题解答](../faq.html#How-to-handle-near-duplicate-data-identified-by-cleanlab?)中了解更多关于处理数据集中的近重复数据的信息。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iefctl3rcw4S"
      },
      "source": [
        "### 非独立同分布问题（数据漂移）\n",
        "根据报告，我们的数据集似乎不是独立同分布的（IID）。数据集的整体非 IID 分数（如下所示）对应于一个统计测试的 `p 值`，该测试用于判断数据集中样本的排序是否与它们特征值之间的相似性有关。一个低的 `p 值`强烈表明数据集违反了 IID 假设，这是从数据集产生的结论（模型）推广到更大总体所需的关键假设。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "execution": {
          "iopub.execute_input": "2024-02-16T06:26:20.911817Z",
          "iopub.status.busy": "2024-02-16T06:26:20.911434Z",
          "iopub.status.idle": "2024-02-16T06:26:20.915049Z",
          "shell.execute_reply": "2024-02-16T06:26:20.914501Z"
        },
        "id": "oEMWOQQPcw4S",
        "outputId": "18eca4cd-2451-4850-960c-0bf1e35d9729"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "0.0"
            ]
          },
          "execution_count": 40,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "p_value = lab.get_info('non_iid')['p-value']\n",
        "p_value"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "c6swPCnncw4S"
      },
      "source": [
        "在这里，我们的数据集被标记为非 IID，因为原始数据中的行恰好是按类别标签排序的。如果我们记得在模型训练和数据拆分之前打乱行，这可能是不重要的。但是，如果你不知道为什么你的数据被标记为非IID，那么你应该担心可能的数据漂移或数据点之间的意外交互（它们的价值可能不是统计独立的）。仔细考虑未来的测试数据可能看起来如何（以及你的数据是否代表你关心的人群）。在非 IID 测试运行之前，你不应该打乱数据（这将使结论无效）。\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uCoKXqBrcw4S"
      },
      "source": [
        "    如上所示，cleanlab 可以自动筛选出数据集中最可能的问题，帮助你更好地为后续建模整理数据集。有了这个短名单，你可以选择修复这些标签问题，或者从数据集中移除非语义或重复的示例，以获得更高质量的数据集来训练你的下一个机器学习模型。cleanlab 的问题检测可以与你最初训练的*任何*类型的模型的输出一起运行。\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qnncoRWUcw4S"
      },
      "source": [
        "### Cleanlab 开源项目\n",
        "\n",
        "[Cleanlab](https://github.com/cleanlab/cleanlab) 是一个标准的以数据为中心的人工智能包，旨在解决混乱的现实世界数据的质量问题。\n",
        "\n",
        "请考虑给 Cleanlab Github 仓库一个星标，如果你有兴趣，也可以参与到这个[项目](https://github.com/cleanlab/cleanlab/issues?q=is:issue+is:open+label:%22good+first+issue%22)中来，比如帮助解决一些简单的问题。。\n"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.8"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
